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Abstract 

For many complex diseases, prognosis is of essential importance. It has been shown that, be¬ 
yond the main effects of genetic (G) and environmental (E) risk factors, the gene-environment 
(GxE) interactions also play a critical role. In practice, the prognosis outcome data can be 
contaminated, and most of the existing methods are not robust to data contamination. In the 
literature, it has been shown that even a single contaminated observation can lead to severely 
biased model estimation. In this study, we describe prognosis using an accelerated failure time 
(AFT) model. An exponential squared loss is proposed to accommodate possible data contam¬ 
ination. A penalization approach is adopted for regularized estimation and marker selection. 

The proposed method is realized using an effective coordinate descent (CD) and minorization 
maximization (MM) algorithm. Simulation shows that without contamination, the proposed 
method has performance comparable to or better than the unrobust alternative. With contami¬ 
nation, it outperforms the unrobust alternative and, under certain scenarios, can be superior to 
the robust method based on quantile regression. The proposed method is applied to the analysis 
of TCGA (The Cancer Genome Atlas) lung cancer data. It identifies interactions different from 
those using the alternatives. The identified marker have important implications and satisfactory 
stability. 

Keywords: Gene-environment interaction, prognosis, robustness, exponential squared loss, penal¬ 
ization, marker identification. 
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Introduction 


For many complex diseases such as cancer, diabetes, and cardiovascular diseases, prognosis is of 
essential interest. In the omics era, profiling studies have been extensively conducted, searching for 
genetic markers associated with prognosis. It has been suggested that, beyond the main effects of 
genetic (G) and environmental (E) risk factors, the gene-environment (GxE) interactions also have 
important implications. Mul tiple statistical method s have b e en de v eloped for Gx E interaction 
analysis. For reviews, refer to Caspi h Moffitt ( 20od ). Cordell f 20091 ") . Thomasl ( 201C)h . and others. 

Denote T as the prognosis time of interest, X = {X\,... ,X q ) as the q environmental/clinical 
variables, and Z = (Zi,...,Z p ) as the p genetic variables. Assume n independent subjects. 
Regression-based interaction analysis, with its broad applicability, has been extensively adopted 
and proceeds as follows, (a) For gene j{= 1 ,p), consider the model T ~ o + Ylk=i Xk&j,k + 

Zj/3j + Z) Xklj,k)-, where </>(•) is the known link function (for example, the Cox or exponential 

model), and ayo, oy.fc, /3j, Tj.fe are the unknown regression coefficients. As usually q « n , this is a 
low-dimensional model and can be fitted using standard, usually likelihood-based, techniques and 
software. Denote pj.k as the p-value for 7 jj.. (b) With {pj.k ■ j = 1 ,p\k = 1,..., q}, conduct 
multiple comparison adjustment using the Bonferroini or FDR (false discovery rate) approach, and 
identify important interactions. 

A limitation of the above approach is a lack of robustness properties. Usually it is assumed that 
all subjects satisfy the same prognosis models. In practice, most genetic studies cannot afford to 
condu ct strict subject selection . Seemingly homogeneous subj ects can have different disease sub- 


types (jHaibe-Kains et al.l . l2012l ) and different survival patterns (iBurgessl. 1201 ill. Cause o f death can 
be misclassified, leading to contamination in disease-specific sur vival (Fall et all 12008 1. The sur¬ 


vival times extracted from medical records are not always reliable ( Bowmanl ( 2011 1. Rampatiee et al. 
( 201.11 ) ). With unrobust for example likelihood- based estimation, even a si ngle contaminated obser¬ 
vation can result in severely biased estimates ( Huber fc Ronchetti . 20091 ). which can lead to false 
marker identification. Another limitation is that, significance level-based identification, although 
asymptotically valid, may generate unreliable results when the sample sizes are small to moderate, 
as in typical profiling studies. A recent study suggests that regulari zed est imation can lead to more 


reliable estimation and hence more accurate marker identification (jShi et ah. 20141) 


With low-dimensional biomedical data, r obust statisti cal methods have demonstrated great 
power. As suggested in a recent review by Wu fe Ma ( 2014h . with genetic data, the development is 
limited and unsystematic and has been most l y on th e an alysis of main effects but not interactions. 
In the literature, relevant studies include Gui et al. ( 201 il l, which identifies important interactions 
using the multifactor dimensionality reduction (MDR) technique. H owev er, this method is limited 
to categorial data (such as SNPs) and not broadly applicable. Shi et al. ( 2014 ) developed a rank- 
based method, which is r obust to m odel mis-specihcation but not data contamination. The most 
relevant study is perhaps IWang et al.l (12015!), which developed a quantile-regression based method. 
With that method, the quantile needs to be specified, which is not a trivial task in practical data 
analysis. The objective function is not differentiable, causing difficulty in estimation. In addition, 
as to be shown in this study, its numerical performance can be less satisfactory under many data 
settings. With low-dimensional data, it has been shown that no robust method dominates the 
others. It is thus prudent to develop alternative robust methods. 

Consider prognosis data with both G and E measurements. The goal is to develop a new method 
for identifying important GxE interactions. Significantly advancing from the existing studies, the 
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proposed method is robust to contamination in the prognosis data. In addition, our simulation 
suggests that, under certain scenarios, its numerical performance is better than the q uant ile regres¬ 
sion analysis, which is perhaps the most popular robust method for genetic data (IWu fc Mai . 120141 ). 
A penalization approach is adopted for marker identification, which differs from the significance 
level-based approach and can have better numerical performance when the sample size is small to 
moderate. 


Robust identification of interactions 


Data and model settings 

In the literature, there are multiple prognosis models. In this study, we cons ide r the accelerated 
failure time (AFT) model, which has been adopted in Shi et al. ( 20141 ) . Liu et al. ( 20131 ). and many 
others. Compared to alternatives such as the Cox model, the AFT model can be preferred because 
of its more intuitive interpretations and lower computational cost, which are especially desirable 
with high-dimensional genetic data. With a slight abuse of notation, still T to denote the logarithm 
of prognosis time. For gene j, the AFT model assumes that 


T = a 


To 


+ 22 XkOtj^k + Zj[3j + Zj 22 Xk'y.j,k + U 


k=1 


k=1 


where e is the random error. 

Assume n independent observations. Consider the scenario where a small subset of the random 
errors are contaminated, leading to contamination in the prognosis times. In a typical profiling 
study, q « n, while p can be comparable to or much larger than n. For subject i, denote C) 
as the logarithm of censoring time and x, and Zj as the observed X and Z values, respectively. 
Under right censoring, we observe (jji = min(Ti,Ci),Si = 1(1) < C)),Xj,Zj). Further denote n t ,j = 
(1, x j, %i,j, • ■ ■ 1 z i,j x i,q) j Cj = ( a j,(h a j,li ■ ■ ■ j a j,qi Pji ■ ■ ■ > Tj,g) > and XJj = ( u l J , . . . , U n j) 

which is a n x (2 q + 2) matrix. For gene j and subject i, the AFT model can now be written as 
Tj = u JjCj + e i,j- Without loss of generality, assume that the data {(yi, Si, Xj, Zj), i = 1,..., n} have 
been sorted according to yi s from the smallest to the largest. 


Penalized robust identification 


A penalized marker identification method is defined by its loss function and penalty, which are 
defined separately as follows. 

Loss function Before defining the robust loss function, we take one step back and consider the 
scenario with no data contamination. When the distribution of e is not specified, likelihood- 
based estimation c annot be adopted. A popular approach is the weighted least squared estimation 
proposed in Stute ( 19931 ) and proceeds as follows. First compute the Kaplan-Meier weights as 


Wl = 


Si 


i —1 


Ui = 


n 


n — % + 


n-j 
'' n - j + 1 


in 


) Sj , i = 2, ,n. 
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For gene j{= 1 ,,p), the weighted least squared objective function is defined as 

n 

-ujjtj) 2 , ( 1 ) 

i— 1 

This loss function has the following properties. With a simple linear regression model, the most 
commonly adopted loss function is the least squared one. Here to accommodate censoring, a weight 
function is imposed to re-weigh different observations according to their observed times and event 
status. When there is no censoring, cjj = 1/n. With the quadratic form, the loss function is not 
robust to data contamination. If subject i is not censored, then cu* ^ 0, and an arbitrarily large 
yi results in arbitrarily large estimates. Biased estimation can happen with data contamination, 
which may lead to false marker identification. 

To accommodate data contamination, for gene j(= 1,... ,p), we propose the exponential squared 
loss function 


Qe(Cj\y,Uj,uj) = ^cj i exp(-( 2 /j - uJjCjf/e). 


(2) 


i=1 


y and u denote the vectors composed of y^s and Ui s, respectively. 9 > 0 is a tuning parame¬ 
ter. The rationale of this approach is as follows. For contaminated subjects with yjS deviate from 
uJjQs (predicted values from the model), (yi — uJjQ) 2 s have large values. The exponential function 
down-weighs such contaminated observations. The degree of down-weighing is adjusted by 9: with 
9 getting smaller, contaminated observations have smaller influence. To accommodate censoring, 
uji s are imposed in a similar manner as the original Stute’s approach. For low-dimensional linear 


regres sion model without censoring, the exponential squared loss has been examined in Wang et al 


( 2013l l. Different from the existing studies, here we consider the more challenging high-dimensional 
genetic data especially interactions. In addition, the Kaplan-Meier weights are introduced to ac¬ 
commodate censoring. As to be shown in Appendix, such differences lead to significant differences 
in the statistical development. 

Penalized estimation For gene j{= 1, , we propose the penalized objective function 


L\,e(Cj\y,Uj,uj) = Qe(Cj\y, U/,u>) - A||Cj||i- 


(3) 


A > 0 is the data-dependent tuning parameter, and 11 • 11 1 is the t\ norm. Denote Cj as the maximizer 
of Uj, ca). Interactions (and main effects) corresponding to the nonzero components of 

Cj are identified as important. 

Th e strategy of using penalization for identifying important interactions has been adopted 


m 


Shi et al. ( 2014 ) and other studies. Here a Lasso penalty is imposed, which is the most commonly 


adopted penalty. A controls the degree of sparsity and number of identified interactions. We impose 
the same A to all genes to ensure comparability. 


Multip le p enalties can take place of Lasso. In some recent studies such as Bien et al. ((2013) 
and Liu et al. ( 2013 1. penalties have been developed to respect the “main effects, interactions” 
hierarchy, which reinforces that the main eff ects corresponding t o the i d entified interactions mu st 
be identified. In a large number of studies ( Zimmermann et al. ( 2011 ). Caspi fc Moffitt ( 200(tI )). 
it has been observed that genes can have important GxE interactions but no main effects. In 
addition, if the hierarchy has to be reinforced, we can identify the important interactions first, and 
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then add back the corresponding main effects. Computationally, Lasso is much simpler than the 
existing alternatives. Our limited experience suggests that with the complex robust loss function, 
complex penalties have a considerable probability of running into convergence problems. With the 
above considerations, the Lasso penalty is adopted here. 

Consistency properties A significant advantage of the proposed method is that it has the much 
deserved consistency properties under the ultrahigh dimensional setting. For many of the existing 
interaction analysis methods, there is a lack of theoretical development. The rigorously established 
consistency properties provide a solid statistical ground for the proposed method. Details are 
provided in Appendix. 


Computation 

We propose a coordinate-wise updating procedure to compute the solution to ([3]). For low¬ 


dimensional data under simpler settings, an iterative approach is suggested in Wang et ah ( 20131 ) 
to select the robust tuning parameter 0. However, under the present high-dimensional settings and 
with the coordinate-wise updating procedure, such an approach is computationally infeasible. Al¬ 
ternatively, for each (A, 6) pair, we compute the solution to each marginal model. This way, we can 
obtain a solution surface over the two-dimensional tuning parameter grid. Then a prediction error 
based method such as cross-validation can be used to select the robust and penalization tuning 
parameters. Computation is conducted for each gene separately and can be realized in a highly 
parallel manner to reduce computer time. 

Consider gene j(= 1, ... ,p). Let ri(Cj) = Vi — u JjCj- The first and second order derivatives of 

Q(Cj) are 


QkiCj) = = 2 Er=i w ^(i,i) fe D(Ci)exp(-rf(Cj)/0)/6', 

QkliCj ) = dCj!kdl],i = 2 Si=l u} i u (i,j)k u (,hj)i ex P (— r i(Cj)/Q)(2 r i(Cj)/@ — l)/0- 


(4) 


For a Cj 1 that is in a small neighborhood of Q, Q(Cj) in © can be locally approximated by 

Q(Cj) ~ <3(0 + 0(O T fe - o + t(Cj - O t <3(O(0 - o- < 5 > 

Replacing Q(Cj) in © with this approximation and taking the first order derivative of L(Cj) with 
respect to the fcth element give us 

0 +1 = O - 0i'(O<3*(O + <3tt(OA- (6) 


Note that when 2rf(Cj)/6 > 1, Qkk(Cj) > 0. Then ([3]) can be maximized at infinity, and the 
algorithm may fail to converge. To avoid this problem, notice that 

n 

QkkiCj) > -2^w^ J)fc ex p(-r-f(Cj)/0)/0 

i= 1 
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for Ui > 0. The right hand side is always non-positive. We re-define 

n 

QkkiCj ) = exp(—r-?(Cj)/^)/^, 

i= 1 

and use the minorization-maximization (MM) algorithm to ensure convergence. The algorithm that 
combines the coordinate descent and MM algorithm is summarized in Algorithm [T] 


Algorithm 1: The coordinate descent MM algorithm for marginal model j 
input : y, and tuning parameters A and 9. 

output: the maximizer Cj defined in Q. 

initialization: let m = 0, Cj = 0. Normalize y and U j such that = 0) 

Ya =i = 0 and E”=i ^ u \i,j) k = n for k = 1,..., 2q + 2.; 

repeat 

777*0 = fn'i 

for k = 1,2,..., 2q + 2 do 

repeat 

C +1 = QJ - (Cf )«*(<:”) + Oi‘(Cr)A; 
q;; 1 -Gj for' - 1 * 2 -...* 2 !,- 2 *// 1; 

777 = 777 + 1 ; 

untii |C™ fc — CJfc X | — a predefined thresholding value. In our numerical study, we set 
the thresholding =10 -3 ; 

end 

until ||C™° - Cj^lh ^ a predefined thresholding value ; 
return as the maximizer. 


As mentioned above, we search over a two-dimensional grid of (A, 0 ) for the optimal. The 
range of A is determined as follows. First, its upper bound A max is selected such that Cj = 0 fo r 
j = 1,... ,p. With the unrobust weighted least squared loss, A max = max^ =1 {||UjVFy||oo} where 
W is the diagonal matrix composed of WjS. With the proposed robust method, the derivatives in J4]) 
can be viewed as a weighted sum of Because the weight for each subject changes with 

Cj, the previously defined A max may not guarantee that Cj = 0 for j. After some trials, we find that 
Amax = 20 max^ =1 {||UjVFyUoo} is in general a safe upper bound for A. The lower bound is chosen 
as A m i n = Amax/1000. The range of 6 depends on the degree of contamination in data, which is not 
known in practice. In our numerical studies, we find that the estimator is not very sensitive to the 
value of 6 . For cautions, we choose a relatively wide range for 0 . Specifically, after centralization, 
we choose 9 € (min™ =1 y?/100, max” =1 yj x 100). 

With the quadratic approximation and minorization, optimization in d3j) is non-convex. There¬ 
fore, there is a risk of converging to local maximums. Thus when trying to solve the estimate at a 
new (A, 9) point, instead of using C ( ] = 0 as the starting value, we use the estimate at a neighboring 
(A, 9) point as the warm starting value. This modification speeds up the computation and also 
improves the convergence property. 
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Simulation 


In simulation, we set n = 300, q = 3, and p = 500,1000. There are a total of 18 nonzero effects: 
3 main effects of E factors, 5 main effects of G factors, and 10 interactions. The positions of 
nonzero main G effects and interactions are uniformly placed. The nonzero regression coefficients 
are randomly generated from uniform( 0.5,1.5). The E and G factors are generated from multi¬ 
variate normal distributions with marginal means zero, marginal variances one, and the following 
variance matrix structures: Independent, AR(0.2), AR(0.8), Band(0.3), and Band(0.6). Under 
the AR(p) correlation structure, for the ith and j th factors, corr = p^~^. Under the Band(/o) 
correlation structure, for the zth and jth factors, corr = pl(\i — j\ < 2), where /(•) is the in¬ 
dicator function. Under each correlation scenario, consider seven different distributions for the 
random error e: N(0,1), 0.95N(0,l)+0.05Cauchy, 0.85N(0,l)+0.15Cauchy, 0.7N(0,l)+0.3Cauchy, 
0.95N(0,l)+0.05t(3), 0.85N(0,l)+0.15t(3), and 0.7N(0,l)+0.3t(3). That is, we consider the scenar¬ 
ios with no contamination and three different levels of contamination. Two contamination distri¬ 
butions are considered with different thickness of tails. The log event times are generated from 
the AFT models. The censoring times are generated independently from exponential distributions. 
The parameters are adjusted so that the censoring rates are about 25%. 

Beyond the proposed method (referred to as “Robust”), we also consider the following three 
alternatives: (a) the “Unrobust” method, which adopts the weighted least squared loss function and 
applies the Lasso penalization for selecting important effects; (b) the “Stute” method, which adopts 
the weighted least squared loss function, does not apply any penalization, and uses the significance 
level (p-value) as the criterion for quantifying the importance of effects; and (c) the “Quantile” 
method, which adopts the quantile regression-based robust loss function and applies the Lasso 
penalization for selecting important effects. The Unrobust and Stute methods, as the proposed, are 
built on the weighted least squared estimation. Comparing with them can establish the advantage of 
robust loss and penalization-based identification, respectively. The Quantile method is also robust 
and has been recently developed. Comparing with it can establish the advantage of the weighted 
exponential squared loss. We acknowledge that multiple methods are potentially applicable to the 
simulated data. The three alternatives have frameworks closest to that of the proposed. 

With the Robust, Unrobust, and Quantile methods, the numbers of selected interactions de¬ 
pend on the tuning parameter values. With the Stute method, the number depends on the p-value 
cutoff. To eliminate the effect of tuning parameter selection on identification, we examine a se¬ 
quence of tuning parameter values, evaluate identification performance at each value, and use the 
ROC (receiver operating characteristics) curve to evaluate the interaction identification accuracy 
of different methods. A representative ROC plot is shown in Figure 1. In this plot, the proposed 
method has the dominatingly better accuracy. 

Summary AUCs based on 100 replicates are shown in Tables 1 (p = 500) and 2 (p = 1000). 
When there is no contamination, performance of the proposed method is comparable to or slightly 
worse than that of the unrobust method. For example with p = 500 and the independence corre¬ 
lation structure, the robust and unrobust methods have mean AUCs 0.861 and 0.889, respectively. 
And with the AR(0.2) correlation, they have mean AUCs 0.901 and 0.881, respectively. With con¬ 
tamination, the robust method outperforms the unrobust method. For example with p = 500, the 
AR(0.2) correlation structure, and 0.7N(0,l)+0.3Cauchy error, the robust and unrobust methods 
have mean AUCs 0.886 and 0.751, respectively. Under all simulation scenarios, the Stute method, 
which adopts the robust loss function but significance level-based identification, has inferior iden- 
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tification accuracy. As has been suggested in the literature, with a moderate sample size, the 
unregularized estimates can be less reliable, leading to inaccurate identification. When comparing 
the proposed method with the quantile regression-based, we see that under the majority of the 
settings, the proposed method has superior performance. For example with p = 500, AR(0.2) cor¬ 
relation, and 0.85N(0,l)+0.15Cauchy error, the proposed method and Quantile have mean AUCs 
0.892 and 0.842, respectively. However, under a small number of settings, the Quantile method ex¬ 
cels. For example withp = 500, AR(0.8) correlation, and 0.7N(0,l)+0.3t(3) error, the two methods 
have mean AUCs 0.863 and 0.933, respectively. Such observations are also reasonable. No robust 
approach is expected to be able to dominate all others. The proposed method outperforms the 
quantile regression-based under most scenarios and provides a useful alternative. 


Analysis of the lung squamous cell carcinoma data 


Lung squamous cell carcino ma is the second most co mmon lung cancer and caus es around 400,000 
deaths each year worldwide ( The Cancer Genome Atlas Research Network . 20121 ). Profiling studies 
have been extensively conducted searching for its prognostic factors. In this section, we analyze the 
TCGA (The Cancer Genome Atlas) data on the prognosis of lung squamous cell carcinoma. The 
TCGA data were recently collected and published by NCI and have a high quality. The dataset 
we analyze was downloaded in April of 2015. 

The prognosis outcome of interest is overall survival. Multiple environmental and clinical vari¬ 
ables are available for analysis. We first remove variables with a low quality of measurements (for 
example with a high missing rate). We then select the following variables for analysis: age, gen¬ 
der (female is coded as baseline), smoking pack years, and smoking status (non-smoker, reformed 
smoker for more than 15 years, reformed smoker for less than or equal to 15 ye ars, cu r rent s moker; 
cod ed as 0, 1, 2 and 3). These variables have been suggested in the literature (jMiller et al.1 (|2004l ) 
and lNa.ka.chi et al l d 199ll )). A total of 422 samples have clinical and environmental measurements 
available. 

For the genetic part, we analyze gene expression data. A total of 18,969 measurements are 
available for 502 samples. When matching the clinical/environmental data with genetic data, 
complete data are available for 404 samples. Among them, 129 died during following. The median 
followup was 30 months. 275 were censored, and the median followup was 18 months. 

The tuning parameter can be chosen using data-driven methods for example cross validation. 
However, we note that the commonly used methods are mostly based on the notion of prediction. 
With a large number of marginal models, the goal is to identify markers top-ranked in a marginal 
sense, not prediction. Thus, following published studies, we vary the tuning so that a prefixed 
number of interactions are selected. In Tabled we provide results on the 33 top-ranked interactions. 
Longer or shorter lists of identified interactions are available from the authors. After the interactions 
are identified, we refit the marginal models without penalization on the main effects to satisfy the 
“main effects, interactions” hierarchy. Beyond the proposed method, we also apply the three 
alternatives considered in simulation. Table [4] in Appendix suggests that different methods identify 
significantly different genes and interactions. Results under the proposed method are provided in 
Table [3j Those under the alternative methods are provided in Appendix. 

To assess the stability of our findings, we apply a cross validation-based approach. Specifically, 
one sample is removed from data, and the proposed method is applied. This step is repeated over 
all samples. For each identified interaction in Table 3, we compute its probability of being identified 


















in the reduced datasets. It can be seen that three interactions have very small stability measures. 
All other interactions have stability measures close to 1. We have also examined those interactions 
not identified and found that their stability measures are all close to 0. This analysis suggests a 
certain degree of stability of the proposed method. 

Literature search suggests that the identified int eractions and co rresponding genes may have 
important implications. Specifically, previous studies ( Shi et ah . 2014 ) have suggested that the ma¬ 
jor GxE interactions occur between genes and smoking status. Among the identified gene-smoking 
interactions in the current study, the CEBPB (CCAAT/enhancer-binding proteins beta) protein is 
a transcription factor that works with other CCAAT/enhancer-binding protein family members in 
the regulation of cell cycle progress, differentiation and pro-inflammatory gene expression. CE BPB 


has b een found to be unregul ated by tobacco smo ke in both human lung fibroblast (jMiglino et al. 


2 0121 ) and mice emphysema (iHirama et a h. 120071 ). Another gene that interacts with smoking is 
EFNA1 (a.k.a. ephrin-Al). A recent in vitro study found that ephrin-Al is o verexpressed in 


tobac co smoke treated bronchial airway epithelial cells compared to control cells ( Nasreen et al. 
20141 ). In addition, the elevated expressions of ephrin-Al are positively associated with tumor 


proliferative capacity in non-small cell lung carcinoma patients ( Giaginis et al. . 2014] ). The gene 


that shows both a strong main effect and interaction with duration of smoking is TERF1 (telorn- 
eric repeat bindin g fac tor 1) . The TERF 1 exp ression levels have been found to be decreased in 
lun g cancer (Lin et al.1 (l2006l). Jian et al. (EoOS)), as well as other types of cancer in several stud- 
( Yamada et al. ( 20021 ). Mivachi et al. ( 2002 )). It functions as an inhibitor of telo merase a nd 


les 


is ide ntified as a prognostic marker for overall survival in non-small cell lung cancer (l.lian et al. 


20091 ). The molecular mechanism of why and how TERF1 decreases in the process of cancer is not 
clear. Our results suggest that smoking can be one of the factors. Other than smoking duration, 
we also find that several genes interact with smoking intensity as well. The gene that interacts with 
both gender and smoking intensity is KATNB1. KATNB1 encodes protein katanin p80 subunit 
Bl, which has been found to participate in cytokinesis by interacting with tumor suppressor gene 
LAPSER1. T he disr up tion of cytokinesis process may potentially cause genetic instability and 
cancer (jSudo fc Marul . 20071 ). In addition to smoking, we find eight genes interact with gender in 
Table [3l Among these, STRADB and CA5BP1 draw our attention. STRADB is an important 
gene in lung cancer progression and metastasis through the activation of LKB1. LKB1 is essent ial 
for G1 cell cycle arrest, cell polarity and stress, cell detachment and adhesion ([Ma rcus & Zhou, 
20ld ). The STRADB encoded protein also interacts with the X chromosome-linked inhibitor of 


apoptosis protein by enhancing its anti-apoptotic activity. In addition, gene CA5BP1 is located on 
X chromosome and found to be gender-associated. 


Discussion 

GxE interactions have important implications for the prognosis of a large number of complex 
diseases. In this study, we have developed a novel interaction analysis method. It is robust to 
contamination in the prognosis outcome, which is not rare in practice but cannot be accommodated 
using most of the existing methods. In addition, we adopt penalization for identifying important 
interactions. This strategy differs from the commonly adopted significance level-based identification 
and can have better marker identification accuracy. Significantly advancing from most of the 
existing interaction studies, the consistency property has been rigorously established under the 
ultrahigh dimensional setting, making the proposed method one of the very few with a strong 
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theoretical basis. An effective computational algorithm has been developed. In simulation study, 
the proposed method is observed to have satisfactory performance. With contamination, robust 
methods are needed. Under the majority of the cases, the proposed method outperforms the 
quantile regression based method. Thus, it provides a useful alternative to the quantile regression- 
based and other existing methods. In the analysis of lung cancer data, the proposed method 
identifies meaningful genes and interactions, which differ from those using the alternatives, with 
satisfactory stability. 

We choose the popular Lasso penalty because of its low computational cost and satisfactory 
performance. A limitation of Lasso is that its results may not respect the “main effects, interactions” 
hierarchy. However, as shown in Appendix, the proposed method can consistently identify the 
important interactions. As demonstrated in data analysis, once the important interactions are 
identified, if the hierarchy is desired, the main effects can be added back. With a robust loss 
function, the proposed method is computationally more expensive. However, as the computing 
can be executed in a highly parallel manner, and with the effective coordinate descent algorithm, 
the computer time is very much affordable. A total of 70 scenarios are considered in simulation 
and show the satisfactory performance of proposed method. More comprehensive comparisons of 
different methods, especially robust methods, have not been conducted in the literature and are of 
interest to future studies. Interesting findings are made in the lung cancer prognosis data analysis. 
Unfortunately there is still a lack of objective way of determining which set of results is more 
sensible. Validation of the findings will be pursued in future studies. 
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Table 1: Simulation: AUCxlOO (sd) based on 100 replicates, p = 500. 


Error 

Method 

Independent 

AR(0.2) 

AR(0.8) 

Band(0.3) 

Band(0.6) 

N(0,1) 

Robust 

86.1(7.7) 

90.1(6.6) 

86.7(6.9) 

89.2(5.7) 

94.3(5.8) 


Unrobust 

88.9(6.6) 

88.1(5.7) 

86 .8(6.1) 

83.6(6.3) 

91.0(8.2) 


Stute 

65.7(4.4) 

63.3(3.3) 

61.6(4.6) 

64.7(3.5) 

61.0(2.3) 


Quantile 

82.9(3.3) 

86.3(1.2) 

93.5(2.1) 

88.9(2.5) 

76.5(5.1) 

0.95N(0,1)+ 

Robust 

91.1(7.8) 

87.1(6.9) 

88.7(5.1) 

87.7(6.6) 

88.1(6.5) 

0.05Cauchy 

Unrobust 

82.8(8.6) 

85.9(10) 

88.7(8.5) 

81.9(12.6) 

78.5(10.2) 


Stute 

68.0(4.7) 

60.1(4.6) 

70.6(4.8) 

64.7(3.5) 

73.3(4.7) 


Quantile 

82.7(3.1) 

83.4(3.4) 

92.4(2.9) 

87.4(2.5) 

71.9(3.9) 

0.85N(0,1)+ 

Robust 

86 .6(8.2) 

89.2(7.0) 

93.9(6.9) 

86.8(6.9) 

83.5(5.8) 

0.15Cauchy 

Unrobust 

77.1(9.4) 

85.0(10.2) 

87.1(12.7) 

72.8(10.3) 

76.2(11.6) 


Stute 

64.2(6.3) 

63.7(5.5) 

67.5(6.9) 

64.7(3.5) 

67.8(6.4) 


Quantile 

81.9(3.1) 

84.2(1.7) 

93.6(2.7) 

89.4(2.4) 

69.1(5.1) 

0.7N(0,1)+ 

Robust 

84.8(6.9) 

88 .6(6.6) 

87.9(5.9) 

89.2(5.7) 

90.1(5.9) 

0.3Cauchy 

Unrobust 

71.7(11.8) 

75.1(13.3) 

77.8(12.8) 

80.8(6.5) 

86.3(8.2) 


Stute 

64.5(8.2) 

59.4(5.8) 

65.7(8.5) 

64.7(3.5) 

64.4(6.6) 


Quantile 

80.1(2.7) 

85.8(2.3) 

92.5(2.4) 

88.2(2.9) 

68.1(8.3) 

0.95N(0,1)+ 

Robust 

80.8(7.3) 

89.4(6.1) 

93.4(5.0) 

69.9(5.4) 

88.8(6.7) 

0.05t(3) 

Unrobust 

76.7(9.1) 

84.7(8.4) 

89.6(9.0) 

82.1(11.6) 

85.4(10.1) 


Stute 

60.8(4.5) 

66.7(4.9) 

68.6(5.0) 

64.7(3.5) 

73.8(4.9) 


Quantile 

82.6(2.1) 

85.4(2.6) 

89.7(2.5) 

87.8(2.2) 

73.4(6.6) 

0.85N(0,1)+ 

Robust 

87.5(7.5) 

85.6(6.6) 

90.6(6.0) 

87.9(6.5) 

79.8(5.8) 

0.15t(3) 

Unrobust 

82.3(11.7) 

79.4(10.6) 

83.5(11.8) 

75.5(13.9) 

75.5(12.2) 


Stute 

67.4(5.4) 

61.4(5.1) 

68.6(5.0) 

64.7(3.5) 

68.2(4.4) 


Quantile 

83.1(3.4) 

85.3(3.1) 

92.4(1.6) 

87.7(3.6) 

72.7(4.1) 

0.7N(0,1)+ 

Robust 

84.4(7.2) 

88 .6(6.8) 

86.3(6.7) 

88.4(5.3) 

85.4(5.3) 

0.3t(3) 

Unrobust 

71.9(11.7) 

71.5(12.3) 

76.0(13.1) 

80.7(6.4) 

80.0(4.8) 


Stute 

62.3(8.4) 

62.0(7.4) 

68.6(5.0) 

64.7(3.5) 

60.6(6.9) 


Quantile 

79.8(4.1) 

83.9(2.6) 

93.3(1.9) 

87.1(3.1) 

68.5(6.3) 
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Table 2: Simulation: AUCxlOO (sd) based on 100 replicates, p = 1,000. 


Error 

Method 

Independent 

AR(0.2) 

AR(0.8) 

Band(0.3) 

Band(0.6) 

N(0,1) 

Robust 

89.2(6.8) 

87.4(6.7) 

94.0(5.0) 

92.8(7.2) 

85.4(5.3) 


Unrobust 

84.4(6.1) 

87.9(6.3) 

91.7(4.5) 

86.8(7.3) 

80.0(4.8) 


Stute 

64.7(6.0) 

66.7(4.2) 

68.6(5.0) 

76.5(4.9) 

73.4(4.0) 


Quantile 

82.4(2.2) 

85.1(3.1) 

92.8(4.2) 

88.7(2.7) 

71.1(3.9) 

0.95N(0,1)+ 

Robust 

86.0(7.4) 

76.4(6.1) 

87.4(5.1) 

89.3(6.2) 

94.5(5.7) 

0.05Cauchy 

Unrobust 

82.0(8.9) 

83.9(10.6) 

88.4(5.5) 

80.0(11.6) 

91.2(7.2) 


Stute 

62.2(4.6) 

64.6(5.9) 

68.6(5.0) 

65.4(4.1) 

71.5(4.6) 


Quantile 

82.3(3.4) 

85.9(2.3) 

90.6(2.1) 

88.4(1.9) 

70.9(6.6) 

0.85N(0,1)+ 

Robust 

85.4(6.8) 

93.4(7.2) 

88.3(7.2) 

88.1(5.5) 

92.4(6.2) 

0.15Cauchy 

Unrobust 

78.4(10.5) 

83.1(11.4) 

84.8(8.2) 

75.5(10.8) 

83.0(12.5) 


Stute 

65.8(7.0) 

72.2(6.2) 

67.5(6.7) 

61.2(5.1) 

70.4(7.3) 


Quantile 

82.1(3.1) 

85.7(1.9) 

92.5(2.3) 

87.6(3.1) 

69.3(5.8) 

0.7N(0,1)+ 

Robust 

88.0(7.4) 

87.5(7.5) 

90.7(5.5) 

92.8(6.1) 

82.7(6.4) 

0.3Cauchy 

Unrobust 

72.7(10.4) 

74.4(14) 

89.3(7.5) 

92.5(8.9) 

74.4(10.5) 


Stute 

58.2(6.2) 

64.9(7.8) 

68.6(5.0) 

63.0(4.9) 

62.4(6.6) 


Quantile 

82.3(2.6) 

86.3(2.7) 

92.4(3.1) 

88 .6(2.1) 

70.6(3.4) 

0.95N(0,1)+ 

Robust 

73.4(5.8) 

87.7(6.5) 

92.5(5.1) 

78.8(5.6) 

90.6(6.3) 

0.05t(3) 

Unrobust 

78.9(7.5) 

85.8(9.9) 

87.6(10.0) 

80.7(9.4) 

89.0(8.4) 


Stute 

66.2(4.8) 

68.9(5.5) 

68.6(5.0) 

68.3(5.1) 

68.9(4.2) 


Quantile 

80.4(3.1) 

83.1(4.1) 

89.9(2.4) 

86.6(4.1) 

69.4(7.0) 

0.85N(0,1)+ 

Robust 

83.0(7.5) 

84.7(8.2) 

82.6(5.8) 

79.4(7.0) 

86.4(6.2) 

0.15t(3) 

Unrobust 

74.5(9.8) 

76.6(11.2) 

74.7(12.5) 

76.4(13.6) 

77.5(10.0) 


Stute 

57.2(5.3) 

64.6(6.2) 

68.6(5.0) 

67.6(5.4) 

65.8(6.1) 


Quantile 

81.6(3.8) 

83.9(3.3) 

82.3(2.4) 

86 .6(2.2) 

69.1(4.9) 

0.7N(0,1)+ 

Robust 

85.2(7.3) 

90.0(6.7) 

85.8(5.6) 

94.8(5.3) 

81.5(5.9) 

0.3t(3) 

Unrobust 

75.6(11.4) 

75.2(10.3) 

81.8(6.4) 

83.0(5.3) 

74.7(11.3) 


Stute 

57.2(6.2) 

64.8(6.6) 

68.6(5.0) 

63.3(7.4) 

62.0(5.9) 


Quantile 

79.7(2.7) 

83.6(1.6) 

91.9(2.3) 

87.9(1.9) 

68.3(6.1) 
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Table 3: Analysis of the lung cancer data using the robust method: estimates xlOO. For the interactions, values in “()” are the 
stability re sults. _ 

Main effects Interactions 



age 

gender 

intensity 

status 

gene 

age gender 

intensity 

status 

AP1S2 

-1.0 

-22.6 

-0.1 

37.4 

20.8 


-0.2(0.998) 


BTD 

0.8 

-8.6 

0.2 

5.1 

-56.7 



15.9(1.000) 

C10ORF54 

0.7 

2.7 

0.2 

6.2 

3.2 


-0.5(0.998) 


CA5BP1 

-1.2 

-6.3 

0.1 

29.0 

7.5 

14.8(1.000) 



CAPN1 

-3.5 

-8.9 

0.3 

-8.6 

-50.0 

-24.7(0.000) 



CEBPB 

-1.0 

-15.5 

0.0 

29.4 

-41.6 



8.1(0.998) 

EFNA1 

-0.7 

5.9 

1.0 

4.0 

-65.3 



12.7(1.000) 

FAM107B 

-1.2 

-19.3 

-0.1 

16.2 

29.2 


-0.8(1.000) 


FLRT3 

-1.2 

24.6 

0.8 

6.1 

33.4 


-1.5(0.998) 


KATNB1 

-0.0 

0.5 

0.2 

26.6 

-76.2 

39.8(0.995) 

30.6(0.059) 


LRRC1 

-3.6 

-13.6 

-0.3 

64.5 

25.5 

-24.3(0.995) 



LYRM5 

0.8 

-10.2 

0.1 

4.4 

-12.0 

-6.8(1.000) 



AP1S2.1 

-1.0 

-22.6 

-0.1 

37.4 

20.8 


-0.2(0.998) 


MY018A 

1.5 

-15.9 

0.1 

-7.8 

16.3 


0 .2(1.000) 


NODI 

-1.3 

16.0 

-0.1 

4.6 

-28.9 


-0.3(0.998) 


NPLOC4 

-3.3 

-34.8 

0.7 

58.4 

-76.3 



14.4(0.998) 

PLEKH02 

0.2 

11.2 

0.1 

1.1 

-15.3 


-0.5(0.998) 


POLR3GL 

-4.3 

-1.7 

0.0 

63.0 

28.2 


0.1(0.998) 


RAB27A 

0.6 

-12.9 

0.0 

10.1 

14.4 


-0.5(0.998) 


SECISBP2L 

-3.7 

-34.7 

-0.5 

55.3 

30.4 


-0.8(0.998) 


SGTB 

2.0 

-17.5 

0.2 

3.3 

7.4 


0.2(0.995) 


STRADB 

-0.7 

-3.4 

0.1 

4.1 

1.2 

-48.5(0.995) 



SWSAP1 

1.9 

-14.6 

0.1 

-2.9 

-12.1 



0.8(0.995) 

TEP1 

1.7 

-33.3 

0.3 

-0.9 

5.4 


0.3(1.000) 


TERF1 

0.2 

-22.0 

-0.0 

43.2 

46.4 



-18.9(1.000) 

THOC1 

0.9 

30.9 

1.2 

-0.8 

31.4 

-67.4(0.002) 



TIGD5 

1.5 

-11.6 

1.1 

-32.5 

3.9 



-10.4(1.000) 

TK2 

-1.0 

-13.8 

-0.4 

25.0 

24.7 


-0.8(1.000) 


TMEM54 

3.7 

-40.0 

0.5 

-33.2 

-6.1 

-69.2(1.000) 



TMEM106A 

-0.8 

-11.2 

0.2 

20.4 

4.1 


-0.3(0.998) 


TOMM7 

1.8 

0.1 

0.1 

-7.7 

-21.4 



2.3(0.998) 

TRIM34 

0.4 

5.5 

0.2 

2.1 

-13.4 


-0.3(0.998) 


YARS2 

0.5 

8.1 

0.9 

-9.7 

-5.8 



-2.5(0.998) 







TPR 



Figure 1: An illustration of the ROC curves for the proposed and alternative methods. 


16 






Appendix 


Consistency properties 

Here we rigorously prove that the proposed method can consistently identify the important interac¬ 
tions under ultrahigh-dimensional settings. With the consistency properties, the proposed method 
can be preferred over the alternatives whose statistical properties have not been well established. 

First we show that, under mild conditions, the proposed method can distinguish the important 
effects from unimportant ones in the presence of censoring. For each j E {1,2, • • • ,p}, define the 
population version of the marginal estimate as 

Cf = arg max^. E{exp(— (T - Uj Cjf/O)}, 

where E denotes expectation under the true model. Denote the kth element of Cj by Cj,k- The 
corresponding important covariate effect index set in Cf 1 is labeled as Sj = {k E {1,..., 2q + 2} : 
Cfl A 0}. Denote Ax = U^ =1 {t : t E Sj , 2 < t < q + 1} as the important set with its corresponding 
environmental variables important in at least one marginal model. If q + 2 E Sj, then the jth gene 
is associated with the disease outcome in a marginal sense. The set {t : t E Sj, q + 3 < t < 2q + 2} 
contains important interactions between the jth gene and environmental variables. Then we have 
the important gene set Ac = U P j=\{j '■ 9+2 E Sj} and interaction set Aj = U^ =1 {(j — l)p+t — q — 2 : 
t E Sj, <7+3 < t < 2g+2}. Denote S c and |S| as the complement and cardinality of set S respectively. 
Denote the kth element of Ujj by U(ij) k , k = 1,..., 2q + 2. 

For each j E {1,2, • • • ,p], define 

D n(Cj) = p(-(yi - nJjCjf/e ) 2 ^ 1 u id 

i =1 

and 

2 n 

4(C j) = g exp(—(y, - u ljCjf/0) 

i— 1 

Let Ty, Ty and tc be the end points of the support of Y, T and C, respectively. Assume the following 
regularity conditions: 

Cl. The observations {(yi, <5«, Xj, Zj), 1 < i < n} are independent and identically distributed; 

C2. T and C are independent and P(T < C\T, X, Z) = P(T < C\T)\ 

C3. Ty < re or tt = tc = oo; 

C4. q is a fixed. For each j E {1,2, ••• ,p}, y/nD n (Cf 1 ) ~^d N(0,T, j), where Sj is a positive 
definite matrix. converges to some negative definite matrix I(Cj^) in probability. 

Moreover, the smallest eigenvalue p* = min,- p m i n (—F(Cf f )) an d the largest eigenvalue p* = 
maxj Pmax(Sj) are bound away from zero and infinity. 

C5. Let Nj denote a sufficient small neighborhood centered at Cj 1 ■ For C},Cj € Nj, there exists 
a bounded constant V such that Cj [/(Cj) — I(Cj )]Cj + F||Cj — C] lb with any ||CjII 2 = 1- More- 


2 (Vi ~ u LCj) 


- 1 




1 




U {i,i)k 1 U {ij)k 2 ) — 


over, For any k 2 G {1,..., 2g+2}, -E(exp(-(y—uAC/)/ 6 *) 
J, where J is finite and Q* eNj. 


2 (y i -uJ i q) 2 


- 1 


Remark 1. C1-C3 have been commonly assumed in models with random cen sorin g. Condition 
Cl is mild and has been proved under some regular conditions in Stute hi 99 A) and Huang et al. 
(200j ). C5 is added to simplify the proof. 

If the truly important effects were known, then we would be able to compute the oracle esti¬ 
mator. Consider the oracle estimation Q with £59 = 0 and 


C Sj = arg max exp (-(?/* - Cj&f/O) ~ A ^ |Cy,fc|- 


(7) 


i— 1 


keSj 


Theorem 1. Consider the estimator defined in &■ Under Condition C1-C5 and 6(n K + I2p* 1 (q+ 
1)A)R < p* ; we have 


Pr f max max IC 7 s — Cfi I > n K + 12 qp X A 
\i <j<pseSj' J ’ J ’ s ~ 


< p exp — 


pin 1 2k + 144<7 2 nA 2 


36p* 


+ 4 p(q + 1) exp - 


np* 


36J(q + l) 2 


where k < 1/2. 

The tail probability in Theorem [I] is exponentially small. In other words, the proposed method 
is able to accommodate ultrahigh dimensional data with 

logp = o(n 1 ~ 2K + nA 2 ). 


Recall that Cj = argmax^ gR 2 ,+ 2 i(Cj), where 


L (Cj ) = ^Wiexp(-( yi - U ljCjf/6) - AllCilli. (8) 

i =1 


Since in (JHJ) is concave, if we can show the oracle estimator Q satisfies the Karush-Kuhn- 

Tucher(KKT) condition, then Q = Qj. 

Define Ac = A p =1 {j : Cj , q +2 / 0} and Al/ = C p =1 {(j-l)p+t-q-2 : f j)t / 0,q+3 < t < 2q + 2}. 
Partition according to Sj as 



IsM cf) ls jSj < Cf) \ 

Is/sfC;- 1 ) IsqsqiCf) ) ' 


The following theorem establishes that the proposed method has the asymptotic consistency prop¬ 
erties. 


Theorem 2. Assume Condition C1-C4, k < 1/2 and = \\Isj c Sj(Cj I )^S j Sj{Cj I ) 1 ||oo < K < 1. 


2 
















If Oq Q -Ag and Oi C Ai, we have 

Pr (o a C 35 and O, C A,) > 1-0 (pexp + ^ f 


n\ 2 (l — Kf\\ 
2p*{l + Kyj) ' 


Below we provide proofs for the two theorems. 

Proof of Theorem [T| 

Recall that 


C j,Sj = argmax^Wjexp(-(yj - u Ji,j) s Cj,Sj) 2 /&) - A ^ |Cj,fc|- 


(9) 


i= 1 


fees,- 


Denote the above objective function as R n (Cj,Sj)- 

First, let rj = n~ K + 12 p~ 1 (q + 1)A and p = J2 exp —- +1 ^^ +1 1 n — ; where k < 1/2. 

To prove that 


Pr 


{llO,Sj ~ Cjfsjh < r 3>j = 1. ■ ■ ■ >p} > 1 - v, 


it suffices to show that 


Pr ( sup R-niCj-S,) < Rn(&)J = !i ’ ‘' ,p) > 1 - V, 
V <b,s,ex y 


( 10 ) 


where X = |C/,Sj : IK j,Sj ~ Cj'sjk = r j , j = !,••• ,pj. This implies that with probability at least 


-M 

U1CH, oauoura 11^1*1 

Recall the definitions of D n (Q) and I n (Cj )• Partition them according to Sj as 


1 - 77 , Rn{Cj,Sj) has a global maximizer Cj,Sj that satisfies ||Cj,s 3 - - Cys, II 2 < r j> f° r 3 = 1, ■ ■ ■ ,P- 


oiiu -‘nVSJ 

Dn,Sj(Cj) 


DMi) - ( aS(C;) 


, J»(Cf) = 


In,SjSj(Cj ) In,SjS?(Cj) 
' J> 1 In,S?Sj(Cj) In,S?S?(Cj) 


Obviously, 


2(/h 0,5,) 


Dn,Sj{Cj) = ^w;exp(-(y; - 


-u 




2=1 


and 


z % \ 

In&SiiCj) = ^E w * ex p(-(:x - u Jj) s , 0 ,S 3 ) 2 /^) 


2 (^- u (iJ )s .0,S,') 2 


2=1 


1 I u (i,i)s i u (i J j)s 3 . ■ 


In fact, by Taylor’s expansion we have 

-M 


(Cj,Sj ) - 
n 

E { ex p(-(y* - u Jj) Si OA-) 2 / 0 ) - ex p(-(y* - u J,i) s /&)' 


2=1 


3 







( 11 ) 


-A E (lfe.1 - l<"l} 

keSj 

= D n .s,j (C fV (Cj,Sj - Cjfsj ) + 2 (O.s'j 

+ ^ (1(^1 “ I Cj,fe I } : 

keSj 


Gj?S J ) Tl n,S J S j (Cj)(Cj,S i 



where Cj lies between and 
It is easy to see that 


(0,S,- - Cft,) 7 4.5,5, (CjXCj.S, ~ C ft,) 

= (Cj, Sj - cft,) T w ef xe,,s, - Cj&) 

- Cft,) T fts,s,(Cj) - isjSji Cft)} (C,,s, ~ Cft,) 
+ (Cj,S, - Cft,) T {in.SjSjiCj) - ISjSjiCj)} (Cj,S, - Cjfsj) 
= Qi + Q 2 + Qs- 


( 12 ) 


By C4, Qi < — \\Cj,Sj ~ Cft.lli/ 0 *- Moreover, Q 2 < I ||Cj',s 3 ' — Cft-lli under C5. Bernstein inequality 
and C5 yield 


!’.•( J„,s,s,(Cj)-%?,&) 7-- >p< 215,1 exp ( - 




,12 / ’ 


9JI5.I 

whereJ| • \\ F denotes the Frobenius norm. Since A max (/ n) s,s,(C,) - ISjSjiCj)) < ||4,s,s,(Cj) ~ 
ISjSj (Cj) IIF; we have Q 3 < ^p*r|. Therefore, the second term in (flTT) can be controlled by 


- Cftft b/..s,.s. (CftXCj.s, - Cft) < ~^* r f + / T J> 

with probability at least 1 — A{q + 1 ) exp ■>a.j 1 (q+i)' 1 ) due t° and |5j| < 2 q + 2 . 

Ss,s,- S 5 .gc 


(13) 


Partition E,- according to Sj as 
and conditions C2 and C4, we have 


For D n g. (Cf )- by the definition of Cft > 


-.s-'E M.s-y.s; 


VnD U:Sj (Cj ) ~>d N(0,Z SjSj ). 

Then for any given t, an application of Bernstein’s inequality, 

Pr (l D ntSj (Cj*) T (Cj,S j ~ Cft,)I > *) < 2ex P ( “ 


nf" 




Recall that r, = n K + 12 / 0 ^ 1 ((7 + 1)A. Let t = |/?*r 2 , then we have 


Pr(Dn,S J -(Cft) T (Ci,S, - Cft,) > ^*rf) < exp 


p'jn 1 2ft + 144(g + l) 2 nA 2 


36p* 


(14) 


4 






d d 

By the Triangle inequality and (^2 \vi\) 2 < d ^ vf for any sequence v-i, we have 


2= 1 


2=1 


A E (icJSi - io,*1} < A E icfi - o,fci < A v ^ c,..s;, - cE 


5,112- 


( 15 ) 


Combining (JTT]), (fT3|) , (fl4l),(fl5l) , and 6(n K + 12p* 1 (q + 1)A)R < p*, we have 

Rn{Cj,Sj) ~ Rn{Cj? Sj ) < ~\p* r2 j + A \/N r i + < 0 

with probability at least 1 — exp - + 36 p* g+ ^ ) ~ 4(g + l)exp 

with the Bonferroni’s inequality, we have the conclusion. 

Proof of Theorem [2] 

Recall that Cj = arg max^ eR 2 9 + 2 L(Cj), where 

n 

L iCj ) = E^w-fa - EjO) 2 / 0 ) - A iiCjiii- 
2=1 

Consider the oracle estimation C- with ^59 = 0 and 


n pi 

36J(g+l) 2 


(16) 

Together 

□ 


(17) 


Ci.fi, = argmax^cj i exp(-(y i - u /6) ~ A E 1C 


I 


2=1 


keSi 


(18) 


Denote the above objective function as Rn,Sj(Cj)- Since L(Q) in (jH) is concave, if we can show that 
the oracle estimation Q satisfies Karush-Kuhn-Tucher(KKT) condition, then Q = Q. Hereafter 
we will focus on proving that the oracle estimation C, satisfies KKT condition. 

Next we want to show that 


l^ n (*-b)loo < A ’ 1 2 j ‘ ‘ ‘ ,P 
where \v\oo = max, \u t \ for any vector v = (zq, • • • , ^| 59 |) and 


ttn(Sj) = ^>exp(- 


& -q.sj) nm %, j)Si Cj, Sj ) 


2=1 


e ' e 

Applying Taylor’s expansion, we have 

(w-u5j) flj C&) 2 1 2(|/ < -uJ J)sj C&) 


u (hi)s?' 


^Ki(SJ) = y cuj exp 


2=1 


0 


0 


" u (*,i)s<? 


(v* ~ u (Es, to) 2 1 / 2( yi - u ( '.Ci.fi,) 


T 


Z --v 

+«E 


u>i exp 


- 1 


i =1 


(19) 


5 










( 20 ) 


x u o,% u oj) S j . (to ~ W 

:= r n + A n , 

where Cj lies between C,^ 1 and Q. From the proof of Theorem 1, we have 

Cj,Sj - Cj*Sj = ln.S;S,(Cj') '{ D n>S .{Cf) + Asgll(Cj' / )}. 
Hence substituting (I2TT) into (1201) . we obtain 

fa - u h)sA s ^ 2 ( 2{yi ~ u h)sM )2 


( 21 ) 


A n = 


i=i 


exp 


e 

»M \-1 


- 1 


X ^{-DnMCr) + Asgn«f)} 

= -In,S j c S j (Cj)In,S j S j (Cj I )- 1 D nt S j (Cj I ) + ^In,Sj c Sj (CjKn.Sj (Cj' / ) _ 1 Sgn(Cj' / )- 
Next we define 


( 22 ) 


a; = -/^(cn'sjsAcn '"n-sAtn+ A^ccr) wcfrv^cr i 

and 0*(5p = T n + A*. From the proof of Theorem 1, we find the tail probability for I n (Cj ) Is 
dominated by that for D n (£j). Thus 

Prflfi^L > A ) ~ Pr (K(^)L > A). 

Therefore, combining the above discussion, we only need focus on H* (5^). In fact, 

l^(^ c )|oo<|r n |oo + |A;| 00 

< |Ai,s?(C, m )|oo + Is ,' .s, i Cf 1 ) Is, s, ( Cf') I A, ..s, i C/ / 11 -x. 


+ A|/ 5 ,c5,(Cf)/5,5,(Cf)- 1 ^ 5 ,(Cf)sgn(Cf 
^(Cf)|oo + ||/ S -S,(Cf)/5 J S J (Cf)- 1 ||oo|^, l ^ / ioo , 

lb' the condition T, = I s ,"S ,K;' ) > s,s,((j’) ! 


< \Dn(Cf ^)|oo T (Cf ^)ISjSj ) _1 ||oo|^n(Ci^)|oo T A11Tcj■ c S' 7 - (Cj^ ) _1 ||oo(23) 


racf)ioo<A 


< K < 1, if 
1 - <I>; 


1 + T, ’ 


(24) 


then from (I23|) . it follows 


K(S, C )|oo < Pn(Cf)loo(l + ^) + A^- 


< A(1 - <Pj) + A= A, 

which proves (119ft . We now derive the probability bounds for the event in (|24ft . Similarly as the 


6 







derivation of (fbH) 


nA 2 (l-^) 2 \ 
2p*(l + $ j ) 2 J ' 


Pr 11- DniCf) loo > X \ + 1 J ] < 2 exp 
By the Bonferroni’s inequality, we obtain 


Pr 


\ D n(Cf)\oo < + ^ 




nA 2 (l — d>j) 2 \ 

2 p *( l + <£>,,•) 2 / 


(25) 


(26) 


Based on the above result, the theorem is proved. 


□ 
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Additional numerical results 


Table 4: Summary analysis results for the lung cancer data using different methods. Diagonals 
are the numbers of identified genes using different methods. Off-diagonals are the numbers of 
overlapping genes. In “()” are the numbers of overlapping interactions. 



Robust 

Unrobust 

Stute 

Quantile 

Robust 

33 

9(5) 

0 (0) 

0 (0) 

Unrobust 

- 

31 

3(0) 

1 (0) 

Stute 

- 

- 

30 

2 (1) 

quantile 

- 

- 

- 

28 




Table 5: Analysis of the lung cancer data using the unrobust method: estimates xlOO. 

Main effect Interaction 



age 

gender 

intensity 

status 

gene 

age gender 

intensity 

status 

BCL10 

5.5 

79.8 

0.3 

105.6 

10.3 



-29.2 

BTD 

6.1 

38.4 

0.9 

91.5 

-187.4 



94.8 

CAPN1 

6.1 

77.3 

-0.0 

97.9 

22.1 



-36.2 

CASK 

6.4 

59.1 

0.4 

81.9 

90.5 



-63.1 

CSNK1G2 

5.9 

31.0 

0.5 

98.0 

-20.5 

-70.4 



DOCK6 

5.7 

100.0 

0.4 

85.6 

6.5 



-35.4 

ECI2 

7.3 

42.0 

1.1 

39.6 

-87.0 



57.3 

ELM03 

5.1 

88.5 

0.1 

122.5 

-34.5 



-12.1 

FAM83H 

5.6 

57.3 

0.5 

116.1 

113.1 



-72.0 

FASN 

6.0 

53.2 

0.7 

93.0 

117.7 



-78.4 

KATNB1 

6.2 

59.6 

0.4 

85.0 

55.6 



-49.1 

LYRM5 

7.4 

35.4 

0.9 

40.4 

-43.4 

12.2 

32.8 


MACROD1 

5.8 

73.2 

0.3 

99.1 

73.1 



-58.4 

NACC2 

6.1 

65.9 

0.3 

92.6 

100.7 



-70.9 

PKP3 

5.8 

73.9 

0.1 

99.0 

43.3 



-54.8 

RBFA 

5.8 

62.4 

0.1 

104.4 

10.9 

-66.1 



RNH1 

6.0 

70.6 

0.5 

79.8 

14.9 



-32.7 

SCYL1 

5.5 

84.1 

0.4 

105.6 

-36.2 



-6.3 

STRADB 

6.1 

59.7 

0.9 

76.1 

-108.6 



62.6 

SWSAP1 

5.8 

69.7 

0.4 

92.9 

18.0 



-30.6 

TEN1 

5.6 

66.3 

0.6 

104.1 

21.9 



-34.3 

TIGD5 

5.5 

46.3 

0.8 

113.7 

125.3 



-73.9 

TMEM54 

6.1 

54.6 

-0.0 

101.4 

-1.3 

-81.0 



TNIP2 

5.1 

85.9 

0.6 

100.9 

-50.6 



-4.2 

TOLLIP 

6.3 

68.2 

0.5 

73.2 

39.8 



-48.4 

TTC22 

5.6 

72.3 

0.1 

114.0 

-20.3 

-51.8 



WASH2P 

5.7 

25.2 

0.3 

109.1 

19.9 

-129.7 



YARS2 

8.1 

18.8 

1.1 

13.9 

-41.6 



26.5 

YIPF2 

5.7 

75.2 

0.4 

99.8 

7.8 



-27.4 

ZNF512B 

6.3 

54.7 

0.4 

89.8 

94.9 

-45.3 

-49.6 


ZNF699 

5.8 

93.7 

0.5 

86.0 

59.1 



-56.4 







Table 6: Analysis of the lung cancer data using the Stute method: estimates x 100. 

Main effects Interactions 



age 

gender 

intensity 

status 

gene 

age gender intensity status 

KLHL9 

8.1 

17.9 

1.0 

29.2 

-358.5 

5.4 

TPD52L2 

7.5 

56.8 

1.0 

31.2 

313.5 

-4.7 

TMEM129 

7.7 

21.9 

1.1 

28.5 

-522.3 

7.4 

PCGF3 

8.2 

20.2 

1.1 

15.9 

-527.4 

7.6 

XPNPEP1 

7.8 

30.0 

0.9 

26.1 

-544.6 

7.9 

FDPS 

7.6 

24.8 

1.1 

37.0 

254.2 

-3.8 

GFOD2 

7.3 

24.7 

1.0 

40.8 

-447.1 

6.1 

MAGED4B 

7.3 

23.3 

1.0 

44.8 

-654.5 

9.2 

PIGG 

8.0 

11.6 

1.0 

27.6 

-354.2 

5.1 

UVSSA 

8.3 

21.3 

1.0 

17.1 

-527.7 

7.7 

LAMTOR2 

7.4 

37.0 

1.0 

36.8 

263.9 

-3.7 

CSNK1G2 

7.2 

26.4 

0.9 

44.5 

-390.9 

5.2 

POLR3C 

7.7 

24.1 

1.2 

26.7 

230.6 

-3.3 

CTBP1 

7.9 

17.1 

1.1 

24.4 

-361.4 

5.2 

PRUNE 

7.6 

26.2 

1.1 

33.6 

258.6 

-3.7 

NELFA 

7.6 

28.8 

1.0 

32.4 

-464.8 

6.6 

MAEA 

7.8 

14.9 

1.0 

33.5 

-418.3 

6.0 

RPS27A 

7.7 

33.9 

1.0 

27.1 

401.0 

-5.8 

TBC1D14 

7.6 

21.4 

0.9 

34.0 

-510.9 

7.2 

MAN2B2 

7.7 

27.7 

1.1 

21.9 

-481.9 

6.8 

DGKQ 

7.3 

31.0 

1.1 

36.0 

-552.6 

7.7 

RBFA 

7.6 

-0.7 

0.9 

39.7 

-565.2 

7.8 

AC 0X3 

7.7 

19.0 

1.2 

20.3 

-595.1 

8.2 

TCF25 

7.6 

27.0 

1.1 

33.5 

-578.8 

8.5 

PIGC 

7.6 

27.7 

1.1 

32.6 

190.4 

-2.7 

TOLLIP 

7.2 

25.6 

1.0 

38.7 

-393.2 

4.9 

CREG1 

7.4 

28.9 

1.0 

44.0 

198.6 

-2.9 

RAB11B 

7.1 

30.2 

1.1 

46.4 

-342.1 

4.7 

CDKN2AIP 

7.8 

40.2 

0.9 

32.3 

-576.4 

8.5 

GAK 

7.9 

31.3 

1.1 

19.4 

-452.0 

6.5 







Table 7: Analysis of the lung cancer data using the quantile method: estimates xlOO. 

Main effects Interactions 



age 

gender 

intensity 

status 

gene 

age 

gender 

intensity 

status 

ARHGAP1 

8.0 

-18.2 

0.8 

16.1 

-338.5 

5.1 




B3GALT4 

7.9 

-8.1 

1.2 

14.1 

-58.9 



1.5 


BCL11A 

8.1 

-12.8 

0.8 

16.3 

9.8 





CDKN1A 

8.1 

-5.3 

0.7 

16.9 

-193.6 

2.5 




CHIC2 

8.3 

-0.3 

0.9 

-1.0 

-530.8 

7.8 

127.8 



CHST10 

8.1 

-7.3 

0.8 

16.6 

18.9 





DNM2 

8.0 

-5.0 

0.9 

13.5 

-245.4 

3.4 




FAM65A 

7.8 

5.0 

0.8 

18.5 

-220.7 

2.7 




GATAD1 

8.1 

-4.2 

0.9 

11.8 

-42.0 




19.6 

GGT5 

8.2 

-14.7 

0.8 

14.2 

-152.4 

1.7 




GRB7 

8.0 

-5.1 

0.6 

24.0 

-1.1 


66.4 



HIST1H4D 

8.0 

-13.5 

0.9 

16.7 

13.0 





LAS1L 

7.8 

-2.1 

0.8 

22.9 

-12.8 





LETM1 

7.8 

-3.4 

1.1 

15.6 

-54.2 



1.1 


MAEA 

7.8 

-3.5 

1.2 

13.2 

-36.1 



0.6 


PAPOLG 

8.0 

-16.6 

0.8 

12.5 

16.0 





PDGFA 

7.9 

2.5 

0.8 

23.1 

-240.6 

3.5 




PDGFRA 

8.0 

5.7 

0.6 

21.1 

-384.8 

4.9 

83.9 



PPARD 

8.1 

-5.6 

0.8 

13.7 

-270.3 

3.8 




PTAFR 

8.2 

-7.0 

0.8 

11.6 

-258.1 

3.7 




RXRB 

7.8 

-5.2 

1.4 

7.1 

-71.6 



1.3 


SCARNA9 

8.1 

-9.5 

0.7 

16.7 

23.7 





SLC12A4 

8.0 

-11.2 

0.8 

18.3 

-224.0 

2.7 




TMEM204 

8.1 

-2.0 

0.8 

11.5 

-317.9 

4.2 




TOLLIP 

8.0 

-6.3 

0.9 

11.7 

-336.9 

4.8 




TOMM5 

8.0 

-6.9 

0.8 

17.8 

25.2 





ZNF141 

7.5 

24.8 

1.4 

8.0 

-105.4 


65.8 

1.3 


ZNF761 

8.0 

-19.2 

1.0 

17.0 

-37.4 


83.0 









